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ABSTRACT 

In the present work, we consider the problem of estimating the prior 
probabilities q k of a mixture of known density functions f k (X), based 

on a sequence of N statistically independent observations. 

The mixture density is : 
g(X|Q) = 

k = 1 


M 

T 


It is shown that for very mild restrictions on f k (X), the maximum 
likelihood estimate of Q is asymptotically efficient. 

However, it is difficult to implement. Hence, a recursive algorithm for 
estimating Q is proposed, analyzed, and optimized. 

For the M=2 case, it is possible for the recursive algorithm to achieve 
the same performance with the Maximum Likelihood one. 

For M>2, slightly inferior performance is the price for having a recursive 
algorithm. However, the loss is computable and tolerable. 
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Introduction : 

In many pattern classification problems, the probability density function 
of each class is known accurately, while the prior probabilities of the 
classes are unknown. 

There are instances where the estimation of prior probabilities from 
unclassified observations is the ultimate purpose of the data processing. 

This situation occurs in machine processing of remotely sensed Earth 
Resources data. 

The probability density functions of the spectral signatures of the several 
crops are known, defined in the multidimensional observation space. The 
objective is the accurate estimation of the proportions of the crops in a 
given area. 

In Section I, the general problem of joint classification of a set of obser- 
vations and estimation of prior probabilities is formulated. In a related 
work by the author, [ 4 ] the problem of simultaneous optimal classification 
and recursive estimation of the prior probabilities has been considered. 

Here, the assumption is that we do not care about the individual classifi- 
cation of each observation, but we are only interested in a good estimate 
of the prior probabilities. 

The method proposed in the present work has the advantages of being 
recursive in nature, of guaranteed fast convergence of the error variance 
at a rate that can be computed, achieving the Rao- Cramer lower bound in 
the two class case. 

We are imposing only certain mild constraints to the probability density 
functions. 

1. Likelihood Function 

Let X N = (X^ ... X N ) be a sequence of statistically independent observations. 
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Each observation XjeE n is distributed according to f k (X i ), under 
hypothesis H k , k=l, . . . , M. The probability density functions f k (X), 
k = 1 , . . . 3 M are assumed continuous and positive for every XeE . 


Let 



1 if X. e H j 

0 if i H j 


Let 


>12 M \ 

K. = (K. K.. • • K. ) 

l \ i l l ' 


Then K i is an M-vector with M-l zeros and a 1 in the j position if 

X sH Thus K. indicates the class membership of X. . 
i j i A 


Let 


= (K\ ... 


T 

Then is an NxM matrix, with columns . It indicates the class 
memberships of the observations (X^ ... X^) 

Let TT = ( tt ^ ... TTjyj) be the vector of prior probabilities of the M 
classes. 

We are interested in determining the conditional likelihood function 
p (X N , K N |tt) 


We have, by the Bayes rule 


P (X N , K N | Tt) = P (x 
= P (x 


N 


N 


K N , tt) P (k N | tt) = 
K N ) P (k N | tt) 
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The above conditional probability density functions are 


# N M K. 

P(X N | K N ) = J Tj [f s (Xi)] 


P(X N | TT ) = 


N M k- 

= TT TT TT 1 

1=1 S=1 


s 


Substituting, we have : 


N 

M 

[ % f s < X i> ] 

P (X N , K N I TT) = ir 

“TT 

1 i= 1 

n 

CO 



K 


In general, both and tt may be unknown. 

It is interesting to note that the pair (K^,tt) that maximizes 

P(X N , K N | tt) has the following intuitively nice properties. 

For known tt , the value K N = k N that maximizes P(X N , K N | tt ) 
reduces to the Bayes classifier, i. e. 


= < 
1 


1 if tt. f (X.) = max tt 
j J 1 m 


m (X.) 
m m i 


0 otherwise 




For known K N , the value tt a tt that maximizes P(X N , K N | tt ) 
is the relative frequency estimate, i. e. 


A 

TT 


= N 


- 1 


N 

x 

i = 1 
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Hence the estimate 

A N A t ^N v 

(K ,tt) = arg max P(X , K tt) 

is intuitively appealing but complicated to realize. 

N 

In the present work, we are not interested in estimating K . We are 
only interested in estimating tt. . If K ^ is known, the relative frequency 
estimate is unbiased : 

A -IV S 

E % = N 1 l EK i = " s 

i = l 

The error covariance matrix has elements 
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We are interested in finding the value of tt that will maximize the 
conditional likelihood function 

P(X N | IT) 

Let 

M 

8<X | "> « l. "s f s < X > 

S=1 

The function g(X | tt) is linear in the unknown parameters 

TT = /TT TT \ 

v M' 

In the present section, we will concentrate on the M = 2 class case. 
In this case, the parameter tt is one dimensional. 

g(X | tt) = tt f j(X) + ( 1 - tt) f 2 (X) 

We make the following assumptions on , f 2 : 



Comment : 

H has been shown that most of the usual probability density 
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functions make identifiable mixtures. In [ 5 ] , there is a list of such 


p. d. f's . 

Because of the convenient form of the function g(X | tt), we are able 
to use a theorem due to Cramer [6 ] , regarding the behavior of the 
maximum likelihood estimate, where 


A 

q 

N 


arg max P(X 

TT 


N 


tt) 


In general, the function 


,,,N 


tt) = log P(X N tt) 


has a number of local maxima. 

The local maxima tt^ are solutions of the likelihood equation : 
log P(X N I TT) = 0 

3TT 

The original version of the theorem requires the satisfaction of Conditions 
1-5, due to Cramer [6 ]. 

If Conditions 1 - 5 are satisfied, any solution of the likelihood equation 
will be a "good" estimate, in a sense to be defined. 

For numerical solution of the likelihood equation, it would make things 
easier if we knew that the likelihood equation has a unique solution. 
Conditions 6-7 due to Perlman [ 7 ] , guarantee that for large enough N , 
and with probability 1, we will have a unique solution of the likelihood 

i 

equation. 



The conditions that must be satisfied, are : 
Condition 1 : 

For almost all XeE n , 


3 


_5_ log g(X | q) 

a-q* 


i=l, 2, 3 
qe[0, 1] 


Condition 2 : 


E -L log g<X I q) 

aq 


q=rr 


= 0 


where tt = true value of the prior probability. 



Condition 3 : 


J(") 


E (JL. log g(X 
Vaq 



q =TT 


^ CO 


Condition 4 : 

2 

E -i- log g(X | q) 

aq 


q = tr 


Condition 5 : 


- J(it) 


There exists a function m(X), such that 


3 

log g(X | q) 
3 
sq 


< m(X) , Vqe[0, l) 


and m(X) is finite 
Condition 6 : 

The K ullback - Leible r information number 


I(q, n ) s J g(X | rr) log 
E n 

achieves a unique minimum at q = TT . 

Condition 7 : 

_JL log g( X | q) is continuous in q for each qe[0, l], 

b q 

uniformly in X. 

Theorem : 

Under the regularity Conditions 1-7, the maximum likelihood 


g(X I tt) 
g(X | q) 


dx 


estimate 



A N I 

P N = arg max ir g( x m | <l) 
q m=l 


is weakly consistent, i.e. 

A 

lim P N = tt in probability 
N— 

A 

Furthermore, the estimate is asymptotically efficient, i, e. , it 

achieves the Rao- Cramer lower bound : 


e ( p n - TT ) 2 — ^ N_1 D< n >] 


Also, with probability 1 there exists an No , such that for all N > N 
- the likelihood equation has a unique solution in the region ttc [0, 1]. 
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Intuitively speaking, the theorem says that for N 'large enough,” 
we will have in [0,1] a unique solution of the likelihood equation. 

Hence, if No is known, we can use an efficient numerical method 
specifically designed to seek the unique zero of a function. 

For the particular problem considered here, we have 

J (tt ) = J [f x (X) - f 2 (X)] [vf j (X) + <l-TT)f 2 (X)] dx 
E n 

In Appendix I, it is shown that Assumption 1 implies that J(tt) is 
upper bounded by [ rr( 1 - tt) ] ^ • 

Hence, for tt^O, 1 , ]( rr) is finite. The physical significance of this 

bound is the following. 

The quantity tt(I-tt) is the variance of the relative frequency 

estimate in the case of observations of known classification. 

Hence the inequality 

N' 1 [J( n )] 1 * N' 1 tt(1-tt) 

t 

is natural. It means that the Rao- Cramer lower bound (left hand 
expression) is higher than the variance of the relative frequency estimate. 

We have to accept the higher error variance due to the fact that the 

observed data are unclassified. 

In Appendix I, it is also shown that the function 

A(tr) = [J(tt)] 1 
is concave in the region [0,1] 
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In such a case, we assume that we know that tt lies in an interval 
1(e), where 


[0,1] 

if 

J(0) 

< + os , 

J(l) < 

*4- 00 

h, i] 

if 

J(0) 


J(l) < 

+ CO 

[0,1- e] 

if 

J(0) 

< + 

1(1) = 

Hh 00 

[e, 1- e] 

if 

J(0) 

= J(l) 

— ~f~ CO 



and e is a small positive number. The Conditions 1-7 have to be valid 
for rrel(c) in order for the theorem to apply. 

In Appendix I, an efficient method for computing J(tt) in the case of 
Gaussian densities is demonstrated. 

In Appendix II, it is shown that Assumptions 1-2 imply the satisfaction of 
Conditions 1-7. 

Hence, the Maximum Likelihood estimate of tt is an efficient method in 
terms of performance. 

The implementation of the estimate requires finding the maximum of the 
likelihood function, which is an degree polynomial. For large N , 

we cannot afford the computational complexity of the above scheme. 
Furthermore, the M. L. estimate is non- recursive. We cannot update it 
efficiently. 

We will now consider a recursive estimate of the mixture parameter tt . 
The basic observation is that the value q = tt minimizes the Kullback- 
Leibler information number I ( q , tt ) , and the minimum is unique. 
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The derivative of I(q, tt) is : 

— L- I( q , rr) = - f g(X | rr) — log g(X | q) dx = 
*q gn L aq 

= -E log g(X | q) tt 

Hence, the estimate of the gradient of I(q, tt), for a fixed q and 
based on one observation X , is : 

_ _i_ log g(X | q) 

sq 

Motivated by the above observation, we consider the following sequential 
estimation algorithm : 

P N+1 * P N + N ' 1 'L(P N ) G < X N+1, P N> 
where G is the current estimate of the gradient : 

g < x n+i* q) = lo § s( x n+i | q> = 

= [ f l ( X N + l) ‘ f 2 ( X N + l) ] ' 

.[q f j ( X N + 1 ) + ( 1_q ) f 2 ( X N + l) ] 
and L(P) is a bounded positive function, defined for Pe[0, 1] . 

L(P) will be chosen later for optimal convergence of the algorithm. 

We define the regression function M(q), for qc[0, l], 

M(q) = E [l ( q) G(X,q)] 



= L(q) F(q) 
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where 

F < q ) = J [f L (X) - f 2 (X)] [qfj(X) + (1-q) f 2 (X) ] • 

E n 

i 

. jVf^X) + (1-tt) f 2 (X)] dx 
The derivative of F(q) is : 

F'(q) = - f ' f 2 (x) ] [ < l f l< x ) + < 1_ <i) f 2 (X) 

E n 

. [rrf^X) + (1-n) f 2 (X)] dx 

Hence, 

F'(q) < 0 Vq e [0 , 1] 

Also, we note that 
F (tt) = 0 
M(tt) = 0 

Therefore, the function F (q) is monotone decreasing in [0,1] and 

it has a unique zero for q = tt 

Let 

Z ( X , q) - G(X , q) L(q) + M(q) 

Obviously, the random variable Z(X,q) has zero mean, conditioned on q 

E[z(X,q) | q] = 0 

To guard against getting an estimate P N+1 that is outside of the 

interval [ a , b } , I put two reflecting barriers at a and b . 

The recursive algorithm then becomes : 
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P N+1 = P N + N_1 [ Z(X N+1’ P N> ‘ M < P N>] 


P N + 1 " R < P N + 1> 


The function R(X) truncates to the extreme points of 1(e) any 
estimate that falls outside. 

If 1(e) = [a,b] 

S' 

b if X £ b 

R (X) = < X if X e [a , b ) 

a if X ^ a 

This is standard procedure in algorithms of this type. 

For the convergence properties of the above sequential procedure, we now 
invoke a theorem due to J. Sacks [ 8 ] . The conditions of the theorem are 
expressed for convenience in the notation of the present paper. 

They involve the regression function M(q) and the sequence of zero 
mean, "noisy" observables { Z(Xj s j > q) j- . 

Condition la : 


M(tt) = 0 

and (q-TT) M(q) < 0 for all qel(e), q^Tr 
Condition 2a : 

For all qel(e) and some positive constant K ^ , M(q) | ^ 

Kj |q - tt j , and for every such that 0 <tj,<t 2 <®, 

inf j M(q) | > 0 , where the inf is taken for t^ < q - tt ) ^ 1 2 » 


qel(e) . 
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Condition 3a : 

For all qel(e) 

M(q) = (q-TT) + 6(q,TT) 

where j(q, ft) = 0( I q - tt | ) as |q-Tr| 
and where a ^ < 0 . 

Condition 4a : 

a) sup E [z 2 (X, q) j q ] < - 
qel(e) 

b) lim E [z 2 (X , q) |q ] = S(tt) 
q -» tt u 

Condition 5a : 

(The version of this condition is stronger than necessary, but it is 
easier to verify for our particular case). 

For a fixed value of q , the random variables { Z(X N , q) 

are identically distributed. 

Theorem : 

(Sacks) Suppose that Conditions 1-5 are satisfied, and assume in 
addition that | a 1 | > i . Then (P N -rr) is asymptotically 

normally distributed with mean 0 and variance 
S(tt) [2 |a 1 | - 1 ] . 

In order to satisfy the Conditions la - 6a , we constrain the function 
L(q) to be positive and bounded : 

0 < C^ L(q) ^ C 2 < + » 



Then, 


(q-TT) M(q) = L (q) (q-TT> F(q) < 0 
Vq 4 rr , qcl(e) 

because the product (q-TT) F(q) is negative for all q^rr . 

In Appendix III, it is shown that Assumptions 1-2 imply satisfaction of 
Conditions la - 6a . 

It is also shown that the constants a^ and S(n) of the theorem are 
S(tt) = L 2 (tt) J(tt) 

a^ = - L(tt) J(tt) = M“) F (tt) 

because : 

F V) = “ J( tt) 

We are now able to express the asymptotic error variance of the 
algorithm in terms of L(tt) , J(tt) and under the condition 
2 | a 1 [ = L (tt) J ( tt ) > 1 

The variance is : 

NE (P N - tt ) 2 *. J(tt) L 2 ( tt) [ 2L(tt) J(tt) - 1 ] 

(If the condition 2 | a j | > I is not satisfied, Sakrison [ ] has 

commented that the convergence rate may be slower than N . 

For a fixed value of tt , we have in Fig. 1, the variance 

- i 

V = J( tt) L 2 (tt) [ 2L ( tt) J ( tt) - 1 J 


as a function of L = L-(tt) 
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Fig. 1 


For L(tt) > [2J(rr)] , the variance V has a global minimum, 

achievable at 

l = r 1 

Hence, we can optimize the nonlinear function L by choosing 
L(tt) = [J(tt)] , we 1(e) 

Substituting the optimum L(tt) into the variance expression, we find 
that the resulting minimum asymptotic variance is : 

E <P N - w) 2 ^ N^[J(.) ] 

But this is exactly the R a o- Cramer lower bound, i. e. , the sequential 
procedure is asymptotically efficient. 
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In other words, if we agree that the mixture approach should be followed, 
the sequential algorithm presented will perform as well as anything else 
in estimating tt„ 

The maximum likelihood estimation scheme requires tremendous complexity 
in order to achieve the Rao- Cramer bound, while the presented sequential 
scheme is very simple and achieves the same lower bound. 

The only difficulty in the implementation, lies in the construction of the 

nonlinear function L(tt). 

However, it is a one-shot construction, so we can do it off-line. In 
situations where we have to estimate prior probabilities repeatedly, while 
the probability density functions remain unchanged, the scheme is 
increasingly attractive. 

In Appendix I, an efficient method for constructing J(tt) (hence L(tt)) 
is presented for the case of multivariate Gaussian densities. 

III. Mixture Approach : M> 2 Class Case 

We now assume that each observation vector X^sE n comes from one 


of M statistical populations -hypotheses. 

Under hypothesis H m , X^ is distributed according to the p.d. f. 


f (X^) Let tt be the prior probability of hypothesis H m 
m K f m 


We need to estimate only M-l of the prior probabilities 

T 

Let TT = [V ... rr x ] be the vector of true prior probabilities. 


and Q = f* q ^ * 


M 


j 


be a vector of arbitrary prior probability 


g(X | Q) designate the mixture density: 

M-l M-l 

g(X I Q) = y q s f g( x ) + [ x " I q s ] f M (X) 
s = 1 s = l 


Let 
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The likelihood function of a sequence of N independent observations is : 

J g(X m | Q) 
m = l 

We will investigate now the performance of the maximum likelihood estimate 

of t r 9 based on a sequence on N observations. 

A 

The M. L. estimate Q N is determined by the equation : 

N 

Q N = ar g max TT g(X m I Q) 

Q eI M 

where 

= ’ Q = ^i * * * ^M*l^ * 3 S ^ 9 s = 


M- 1 

1 q s * 1 } 
s = 1 

We will make two mild assumptions about the densities f m (X), similar 
to the ones for the M=2 case. 

t 

Assumption 1 : 

f (X), are continuous and nonzero for all XeE n . 

/ 

Assumption 2 : 

The densities f^(X), K=1 , . . . , M make an identifiable mixture 

g(X|Q>. 


For assessing the properties of the maximum likelihood estimate, we 
will use the multidimensional version of the theorem used in Section II. 
The parameter space now is M - 1 dimensional. 
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The Conditions 1 5'of the following theorem are due to Cramer, [6 ] 
and Conditions 6 '- 7 'are due to Perlman [7]. The last two Conditions 
guarantee that for N 'large enough, " the likelihood equation will have 
a unique solution in 1 ^- (region of interest). 


Condition 1 : 


For all X e E n , the derivatives 

* + 3 . 

— 1 og g( X Q) , s , m — 1 ,... ,M 1 


aq 1 aq 3 

s r 


exist for all Q e 1 and i , 3 - 1,2,3 


Condition 2 : 


E -A- log g( X Q) 


aq 


= 0 


q = rt 


for s = 1 , , . . , M - 1 

where tt = true value of the prior probability. 
/ 

Condition 3 : 


J 


s K 


(*> 



h 

a 

E 

g<x 

[aq s 



for s,K = 1 , . . . , M- 1 

t 

Condition 4 : 


< « 


Q=tt 


aq 


K 


log g(X Q) 




Q=tt 


for s, K = 1 , . . . , M-l 




Condition 5 : 


There exists a function m(X), such that 

< m(X) VQeI M 

for i , j = 1,2,3 , s,K - 1 , . . . , M • 1 

and m(X) is finite, except on a set of probability zero. 

/ 

Condition 6 : 

The Kullback-Leibler information number 

I(Q,tt) = J g(X I *) log 
E n 

achieves a unique minimum at Q=tt 
Condition 7 : 

— — log g(X [ Q) is continuous at each 

Sq s 

QeI M , s = 1 , . . . , M - 1 , uniformly in X. 

Theorem : 

/ / 

Under the regularity Conditions 1 - 7 , the maximum likelihood 
estimate 

A N 

Q n = arg max tt g(X m | Q) 

Q m = 1 

is weakly consistent, i. e. 

A 

lim Q n = tt in probability 
N -+ co 


g (X j TT) 
g(X I Q) 


dx 


i+j 

\ J- log g(X I Q) 
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A 

Furthermore, the Maximum Likelihood estimate is asymptotically 

efficient, achieving the Rao- Cramer lower bound. 

Also, with probability 1, there exists an No , such that for all N > No , 
the likelihood equation has a unique solution tt = (tt^ ^M-l' * in 

in the region 


I * 


TT 


0 


< "i < 1* 


M-l ' 

i = l M-l, Y, \ < 1 ’ 

k= 1 J 


A T 

= E (Q N ‘ ") ( Q N - ") 

be the error covariance matrix. 

Let A = (a l ... a M-1 ) T be an y wei g htin £ vector with nonzero 
norm. 

Then the above property stated in the theorem can be expressed as : 


-1 

lim N [a T R^ 1 (tt)AJ = 

E 

2 ~ 

[A T 7 log g(X | TT) ] 

N *4 a> 



Hence, the maximum likelihood estimator 

A 

q N 

performs better than any 


estimate. 

In Appendix IV, an upper bound to the function J s ( rr ) is found. 
The bound is : 


J sK w 


- 1 

TT ( TT 

M ' 


K 


" ) 
s ' 


_ i 

2 


[< 


7T J- TT ) (TT 4 

K + * s 


M 


)] 


3/2 


M- 1 

where = 1 - £ 

K = 1 
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This bound is finite for 

"s’ "k’ "M ^ ° 

With arguments similar to those for the M = 2 case, it can be easily 

/ / f * 
shown that Assumptions 1-2 imply the satisfaction of Conditions 1 - 7 . 

The conclusion is that the maximum likelihood estimate of tt ‘Nvorks" 

for the mixture model. 

The implementation of the maximum likelihood estimate of t r is numeri- 
cally difficult. With increasing number of observations, N , the computa- 
tional complexity of the M.L. estimator increases tremendously. 

Motivated by the difficulty in implementation, we will now propose and 
analyze a recursive estimation procedure. 

The intuitive basis is the minization of the functional I(Q,tt). 

KQ.tt) = E {log [g(X | it) (g(X | Q) ) 1 ] tt} 

The gradient of I with respect to Q , is : 

r 1(Q, rr) = E ■[ v log [g(X | rr) (g(X | Q) ) ] "} = 

- - E £ v log g(X | Q) tt_ 

Therefore, an estimate of the gradient of I < Q , tt ) , based on one observa- 
tion, X , is the vector 

7 log g(X | Q) = [g(X | Q) ] 1 [f x (X) - f M (X) 

f M- 1 <X ) ' f M (X * ] 


i • * ♦ * 
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This observation motivates the following gradient algorithm for recursive 
estimation of rr . 

Q N+1 = Q N * < N+1 > _1 L <Qn ) Vl0g g(X N+l | Q N } 

Here, L(Q) is a scalar function of Q , positive and bounded between 

[c lf c 2 l. 

0 < C x < L(Q) <: C 2 < + =0 

L(Q) will be adjusted later for optimal convergence of the algorithm. 

In order to examine the convergence properties of the algorithm, we need 
to define the regression function M(Q). 

M(Q) is an M-l dimensional vector function. 

M(Q) = E {L(Q) <7 log g(X J Q) Q } 

After substitution, we have 

T 

M(Q) = .[Mj(Q) f . . • .M m . 1 (Q) J 

where 

M k (Q) = - L(Q) J g(X | TT) [g(X | Q)] 

E n 

• [ f K (X) ' f M (X) ] dX 

K = 1 , . . . , M- 1 

We note that 

M k <tt) = 0 
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hence 

M(tt) = 0 

We define the random vector 

Z(X,Q) = L(Q) v log g(X j Q) - M(Q) 
we have : 

E (Z(X,Q) | Q )= 0 

We will define a region I M (A) in M-l dimensional Enclidian space. 
Let ,A = (a . . a^), where a. are positive numbers, much smaller 

than 1. We define the region I ^ ( A ) as follows : 


*M (A) = { Q : Q = (q l • • • q M- d ' q K S a K ’ 

M-l 

K = 1 , . . . , M - 1 , a M a 1 - l q K } 

I< = 1 

We are now ready to apply a multidimensional stochastic approximation 
theorem due to ]. Sacks [ ] . The conditions of the theorem are 

expressed in terms of the function M(Q) and the random variables 
Z(X,Q>. 

Condition 1 : 

T 

M ( tt) = 0, and for every e>0, inf (Q-tt) M(Q) > 0, 
where the inf is taken over the region : 


1 m< a > n {Q ; 




e 


1 
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Condition 2 : 


There exists a positive constant K-^, such that, for all Qel^(A), 

|m(Q) 


< K I Q - tt 


Condition 3 : 


For all QcI m (A), 

M(Q) = B(Q - tt) + s(Q , ") 

where B is a positive definite (M-l) x (M-l) matrix, and 
|[«(Q,' t ) II = o ( |q - rr|| ) as Q - TT - 0 


Condition 4 : 


sup E { || Z ( X , Q ) 

Q eI M (A) 



< + to 


lim E | Z(X, Q) Z T (X, Q) 
Q TT 


Q } = S(tt) 


where S(rr) is a nonnegative definite matrix 
Condition 5 : 


Conditioned on Q, the sequence of random variables 
Z(X-^, Q) , is identicaly distributed. 


Let b^, . . . , b^ k 0 tlie ei S envalues B in decreasing order. 
Write B = PBjP”\ where P = orthogonal matrix and 
B | = diag (b l . . . b M _ x ) 

Let S^(tt) = ijj^ element of S(tt) 
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* th 

and S jj( TT ) = i»j element of 
S*(tt) = P“ 1 S(tt) P 
Theorem : 

Suppose Conditions 1-5 are satisfied. 

Assume, further, that > I 

Then, N^(Q N - tt) is asymptotically normal, with mean 0 and 

i • th 

covariance matrix P F P 1 , where F is the matrix whose ( i , j ) 

element is 

(bj + b i - l)' 1 s * j ( TT ) 
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In Appendix V, it is shown that Assumptions 1-2 imply satisfaction of 
Conditions 1-5 for the region QeI M (A). 

Hence the proposed recursive estimation algorithm will converge to the 
true value tt , and the convergence of the error covariance is of the 
order N ” * , 

The reason for achieving high speed of convergence is that the stochastic 
approximation theorem of Sacks was invoked. 

It requires more stringent conditions for convergence than Blum's [9 ] 
theorem, for example, and the reward is that a unique zero of the 
regression function is guaranteed, hence we have speedy convergence. 

In order to keep the sequence of estimates {Qn} within the region 

Im<A), for convergence purposes, we make a slight modification. 

The new computed estimate is : 

Qn+1 = q n ■ < N+1 ) _1 l( Qn> " 1o s s< x n+i q n ) 

We construct Q N+1 from Q ^ +1 by truncating to the boundaries the 
coordinates of Q^ +1 that are outside of I M ( A ) , so that 

Qn+i e ’ 

In Appendix V, the error covariance matrix is computed. The result is 
as follows : 

Let D(n) bean (M-l) x (M-l) matrix with elements 
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D Ks^ tt ^ - J [ g < X Ij^l [ f K* X * ' f M^ X ^ ] ' 

E n I 

• [ f s< X > - f M< X >] dx 

Let s d 2 > ***^ d M-l ^ the eigenvalues of D(tt). 

Let 

D(n) = P' 1 diag(d 1 . . . d M .j) P 

where P = orthogonal matrix, consisting of the eigenvectors of D( TT ). 
Then, using the above theorem, it is found in Appendix V that the 
asymptotic error covariance matrix is : 

lim NE (Q n - tt) <Q n - n) T = PFP' 1 

N^CO 

where 

F (tt) = L 2 (n) diag [dj, (2 L(tt) dj.-l ) 1 

’ d M- 1 ( 2L(Tr) d M- 1 ) ] 

The motivation for employing the recursive estimate was to achieve a 
simpler estimate than the Maximum Likelihood one. It is expected that 
the convenience of having a recursive estimate will be paid in the form 
of increased error variance. 

The question is, how much performance did we sacrifice ? 

Furthermore, it seems at a first glance, that it might be possible to 
recover some of the incurred loss by cleverly choosing the function L(tt). 
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In the case M=2, the loss was completely recovered, and the 
Rao-Cramer bound was achieved with the use of the optimal function L(^). 
We will compare the performance of the following three estimators of rr : 

A) Maximum Likelihood Estimator 

B) Recursive Estimator 

C) Relative Frequency Estimator 

Actually, Estimator C can be implemented only when the data are observed 
noiselessly. 

This requirement is equivalent to the densities f^(X) having disjoint 
support sets. 

Therefore, comparison of Estimator C to the others is only an indication 
of the loss in performance due to noisy data. 

Let 

R s (rr) = lim NE (Q N - tt) (Q n - tt) T 
N -* a> 

be the asymptotic error covariance of the estimator s . 

The supercript s will indicate whether we have the A,B,or C estimator. 
Let A = (a 1 -..a M _ 1 ) be an arbitrary weighting vector with nonzero 

norm. 

The magnitude of the quantity 

[a t [rV)]’ 1 A ] _1 

is indicative of the "magnitude" of the error covariance matrix. The 
error covariance matrix of the recursive estimator satisfies the equation . 

[ R B ( tt) ] 1 = P' 1 F' 1 P 

The Maximum Likelihood estimator achieves the Rao-Cramer lower bound, 


hence : 
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-1 0 . 
[a t (r a (tt) 3 _1 a ] = [ e [a t viog g(x I M r _ 


We have 


E [A T v log g( X | it) ] 


T 

= A A E 


[ 7 log g(X | Ti) J [7 log g(X | tt) ] A = 


= A T D(tt) A 

= A T p' 1 diag (d^ . . . d M _ x ) P A 

= A t p T diag (d t ••• d M-l^ PA 
- 1 T 

(because P 1 = P ) 

The matrix D(tt) is symmetric. 

Hence, 

D( tt) = D T (tt) = (p T diag (d 1 ...d M . 1 ) P ) 
D(tt) = P diag ( d 1 ...d M _ 1 )P T 


Using the above observations, we have : 

[a T f R A (tt) ] " 1 a] = [a T P T diag(d x ...d M 
[A T [ R B (tt) ] _1 A ] 1 = [A T P T F ' 1 P A ] 


- ! > P A 
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where 

~ 2 

F -1 = [l(tt) ] diag £ ( 2 L(tt) d^l^dj 1 

.• • • * ( 2L(n > d M- 1 ‘ 1 ) d M L -l] 

We note now that each of the terms of F ~ 1 is smaller than the 
corresponding d ^ . 

Because, the inequality 

- 1 - 2 

( 2L ( Tr ) d K - 1 ) d K [l(") ] * d K 

is equivalent to : 

d K - 1 j i 0 

Hence, the conclusion is the following inequality : 

[a T [R B <tt) r 1 A ] a [A T [R A (tt) ]" 1 A ] (a) 

This inequality is true for any weighting vector A . 

It expresses the exact loss in performance, asymptotically speaking, when 
we use the recursive estimator instead of the Maximum Likelihood one. 

In Fig. 2, the magnitude, y , of the K dl diagonal term of F ^ is 

plotted as a function of L . 
d K J L ~ 2 (2Ld K - 


1) 
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Fig. 2 


Y k (L) has a unique global maximum for L = d K * . 

The choice of the function L should be such as to make each y K as 
close to its maximum value as possible. 

Because then the Rao-Cramer lower bound will be approached as closely 
as possible. 

Obviously, we cannot maximize all y ^ simultaneously. 

Hence, we choose to maximize their average : 

M- 1 

T(L) = ( M- 1 ) " 1 £ Y k (L) 

K = 1 


We have : 

T(L) = d' 1 L‘ 2 (2Ld - 1) 
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where 


d" 1 = (M-l) 


M- 1 

- 1 y d ' 1 

^ K 
K = 1 


We have : 


d „ d ^ d 
1 M-l 

The function T(L) has the same form with y (L) if we put d 

K K 

Hence, the choice of L that maximizes T(L) is : 

M-l 


= d 


L 0 ( tt) = d' 1 = (M-l) 


^ l d 


- 1 
K 


K = 1 
- 1 


Since d is an eigenvalue of [ D(tt) "| , we have : 

L 0 (tt) = (M-l)' 1 trace [D(tt)] 

It is much easier to compute L 0 (^) for each tt elj^(A) by this 
formula. 

If noiseless observations were available, the relative frequency estimate 
of the prior probabilities would have asymptotic error covariance matrix 

Q 

R ( tt ) , with elements 


where 


c iJ 

= t r . ( 5 . . - 

1 J ij 

V 



1 for 

It 

*rH 

5 i j 

n 





0 for 

i !«J 
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and " m = i - 


M-l 

I 


5 = 1 


TT 

j 


The inverse matrix, 


[ rC(tt > ] 


has elements 




TT 

M 



) 


Hence, 


A 


T 



.,-1 

(TT) ] A 


M-l M-l 

I I 


i- 1 j= 1 



M- 1 

- I 

i = 1 


2 " 1 
a. ^ + 


M-l M-l 

+ M Y Y a i a j 

i=l j = l 


M- 1 

= l 


i = 1 



+ 


- 1 

TT 

M 



We also have : 

A T [R A U) ] 1 A = J g(X j tt )" 1 . 

E n 

M- 1 

• [ I a i ( f i< X > ' f M 

i= 1 


-1 2 



1 


(X))] 2 


dx 
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M-l M-l 

= X X a i a K J UC (tt) 

i=l j=l 

For 

A = (0 0, a K , 0 0) > a K ^ 0 

we have 

A T [r A (tt) ] A = a K JkK ( w) 

and 

A T o c <">r A ■ 4 (4 - " M ) 

Using the result of Appendix IV, we have 

J KK ("> * ("k + TT M > 3 <^K TT M ) ‘ 1 * <V 1 + V !) 

hence, for such A's we have 

[A T [R A (it) ]■ 1 a ] _1 , [a T [R C M ] _1 A ] _1 

I have not been able to prove the above inequality for general A . 

I conjecture that it is true in general, because the left side expresses the 
Rao- Cramer bound on estimating the mixture priors under noisy observations, 
while the right side expresses the variance of the relative frequency 
estimate under noiseless (or perfectly classified) observations. 

In any case, for a given weight vector A, we can compute both quadratic 
forms. Their relative sizes will give us a measure of performance loss 
due to noisy (unclassified) observations in estimating the prior probabilities. 

Conclusions 

We consider the problem of estimating the mixing prior probabilities when 
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the probability density functions of a mixture are known. 

It was shown that the maximum likelihood estimator is asymptotically 
efficient, but difficult to implement. 

Hence a recursive estimator was proposed and analyzed. Using 2 
stochastic approximation theorems due to Sacks, it was possible to show 
convergence to the true value. 

Also, the asymptotic error variance was computed in a closed form. 

Because, of the closed expression, it was possible to see the performance 
loss due to the use of a recursive algorithm. 

For the binary mixture, it was possible to modify the recursive algorithm 
by means of a memoryless nonlinear transformation, and achieve asymptotical 
efficiency. For the M ary mixture with M > 2 , use of a memoryless 
nonlinear transformation in the recursive algorithm decreased the error 
covariance, without achieving asymptotic efficiency. 
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Appendix I 

The purpose of the present appendix is to show that 

J(TT) < [ TT(1 - tT) 3“ 1 for TT jf 0,1 
and that the function [ J(tt) ]’ is concave 

for arbitrary densities f ,(X) , f 2 (X) that are nonzero for all 

XeE n . Also a method will be given for computing J( n ) in the 
Gaussian case. 

Let 

s = it ( 1 - tt) ~ 1 
Assume 

TT 7^ 0,1 

j(tt) can be written : 

-i 2 

J(n) = (1+S) J [l-f 2 (X)(f 1 (X)) ]• 

E n 

-1 'I 

• [s+f 2 (X)(f 1 (X)) j f t (X)dX = 


= (1+s) J {f 2 (X)(f 1 (X)) l '- ( 2+s) + (s+1) 2 
_ n 


- 1 


[s+f 2 (X)(f 1 (X)) _1 ] } fj(X)dX 


Hence 


(1+S) 


-2 


= -1 + (l + s)s' 1 . J f x (X) 


n 


-I 


s[s+f 2 (X) (f 1 (X))' 1 ] dX 


* J(tt ) 
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The function s£s + f 2 (X)(f^(X)) is positive and upper bounded 

by 1. Hence, we can upper bound J(tr) ; 

J(rr) <; (1+s) 2 • s" 1 

or : 

J<") * [tt(1-it)]“ 1 
[J(tt) ]“ 1 * tt(i-tt) 

It is seen that only for tt = 0 or 1 there is a possibility for J( rr ) 
to be infinite. 

A general method will now be given for computing J ( rr) in the case 
of being multivariate Gaussian densities. The approach is 

an extension of a method in [2 ] and [4 ] . 

Let 

*j(X) = N(X,0,R 1 ) 
f 2 (X) = N(X,M 0 ,R 2 ) 

where = M 2 - = difference of mean vectors. 

Let A be the nxn orthogonal matrix satisfying the relations : 

AR A T = I 

ar 2 a t = A 
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where a = diag ( X ^ • X n ) 

and x i are the eigenvalues of R 2 with respect to R 
Hence, they satisfy the equation : 

I R 2 - xRi i = ° 

Let M = A Mq = (m 1 ...m n ) T 

If we make the change of variables 
Y = AX = (y l . . . y n ) T 

the transformed densities are : 
f ! (Y) = N(Y,0,I) 

f 2 (Y) = N(Y,M, A) 

It is sufficient to compute the quantity : 

J f l( Y) • [s + f 2 (Y)(f 1 (Y))- 1 ]‘ 1 dY = 

E n 

= E{[ s+f 2 (Y)(f 1 (Y) )‘ 1 ]" 1 | Hj 

Let 

z = log[f 2 (Y)(f 1 (Y))- 1 ] 

Then 

n 

Z = & y y 2 - x ' 1 (y -m ) 2 - log x 
^ k k k k k 

k= 1 

The above conditional expectation can be written : 

-{[-■]“ i", } 
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Under hypothesis H^y^ are Gaussian, zero mean, unit variance 
independent random variables. 

We are now in a position to construct the characteristic function of 
z under the hypothesis H ^ 

Let 

C(j w) = E | exp ( j w z ) | H x ^ 

Let 

-1 

a k = 1 * x k 
■ b k = m k (1_x k ) " 1 

h k = (a k b k )2(1 " a k )_1 + log x k k = l , . . . m 

Then, 

n 

C(jw) = n F k (jw) 
k = l 

where 

F k (jw) = (l-2a k jw)"^ exp [-2(a k b k ) 2 (l-2a k jw) 1 - 

' 3 wh k ] 

The probability density function g(z) of the random variable z 
under hypothesis H^, can be computed from C(jw) by an inverse 

Fourier transform. 

Let 

q = 3.14159 


- 1 r r 

g(z) = ( 2 q) 1 J C(jw)exp (-jwz)dw 

- DO 

We can finally compute the desired quantity : 


E { [s + e 2 ]' 1 



J g(z) [s+e 2 ]' 1 dz 


We will now show that the function [J(tt)] 1 is concave. 

This fact was noticed by Boes f 1 ]. 

The second derivative of [J(tt)] * is : 

[J(’T)]' 1 = {-2 J< f l - f 2> 2 g" 1 dx • 

dTT 

• [(q - f 2 ) 4 g' 3 dx + 

+ 2[J(f 1 - f 2 ) 3 g' 2 dx] 2 }j 

/ {J< f l ' f 2> 2 S'' dx I' 

g = TTfj^ + ( 1 " Tt) f 2 


where 
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Using Schwarz’s inequality, we have : 

{ J [< f l ' f 2> S _1 ] 3 S dx f = 

= { J [ (f l " f 2> 8 1 J ’'g [ (f l " f 2> g 1 ] / g dX ) * 

* J ( f i - f 2 > 2 S' 2 8 dx J (f l ‘ f 2 )4 g" 4 g dx = 

= J (f i ' f 2 > 2 g’ 1 dx J< f i ‘ f 2 )4 g" 3 dx 

Hence, the numerator of the expression for the second derivative is 
negative. 

Therefore, 

, 2 i - i 

_ cl — [J(tt)]’ a < 0 for all ttc[o,1] and hence [ J(tt) ] 

dTT 2 

is concave. 

In Fig. 3, we show the shape of [J(rr)] ^ in relation to tt(I-tt), 
which is a lower bound. 



0 0.5 1 

Fig. 3 [ J(tt) ]” 1 > tt(1 - tt) 
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Appendix II 


We need to check whether conditions 1-7 are satisfied by the class 
of density functions fj(X),f 2 (X) that satisfy Assumptions 1-2. 

The derivatives appearing in Condition 1 are : 


Jr 

» k log g(X | q) = (-l) k_1 (k- 1 ) ! [fj(X) - f 2 (X) ] . 

aq k 


• [qf x (X) 


+ (l-q)f 2 (X) J 


for k = 1 , 2 , 3 


Using this formula, it is straightforward to check that 


E f- q log g(X 


q) 


q = tt 


0 


E log g(X | q) 

3q 


— rr 


Ep- log g(X I q)) 

x 3q 


= TT 


= K 11 ) 


where 

2 - 1 

J<") = J [fj<X) - f 2 (X) ] [tt f x ( X ) + (l-TT)f 2 (X) ] dX 
E n 

Hence Conditions 1-4 are satisfied. 

For Condition 5, 


* 3 , , , 


fj(X) - f 2 (X) 

, log g(X | q) 

sq 3 

= 2 

qf^X) "+ (l-q)f 2 (X) 
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s 2 | f x (X) - f 2 (X) | 3 [A(X)]- 3 

where A(X) = min (fj(X) , f 2 (X)) 

Since A(X) >0 VXeE n , and f ^X) , f 2 (X) are bounded » 
Condition 5 is satisfied. 

For Condition 6, it is known that the Kullback-Leibler information 
number I(q,rr) has the following properties : 

I(q,Tt) = 0 iff g(X | tt) = g(X ( q) 

VXcE n 

and I( q ,tt ) > 0 otherwise. 

Because of the identifiability Assumption 2, we can have 

g(X | tt) = g(X | q) VX e E n only for tt = q 

Hence, Assumption 2 implies that I(q,TT) achieves a unique minimum 
at q = tt, and Condition 6 is satisfied. 

The function 

B(q, X) = ^ log g(X| q) = 

o q 

= [fj(X) - f 2 (X)] [q f ! (X) + 

V 1 

+ < 1- q) f 2 (X)J 

is continuous in q for all q c [0 , 1 ]. Furthermore, B(q,X) is 
bounded, therefore, it is uniformly continuous in q, and Condition 7 
is satisfied. 
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Appendix III 

Condition la has already been shown to be valid. 
For Condition 2a, we have : 


M(q) | ^ C 2 I F(q) 


F(q) and F (q) will be shown to be bounded. 
Let 


e z = f 2 (X) [fj(X) ] 


-1 


We can write : 


-1 

F(q) = J f L (X) [tt + (l-TT) e z ] [ q + (1-q) e z J dX 


n 


J f 2 (X) [ ne~ z + ( 1 - tt) ] [ q e" z + (1-q) ] dX 


E 


n 


The second integral has the same form with the first one. If we 
interchange f^ and ^2 ,TT an< ^ 1 -TT > Q an ^ l _c l secon< ^ 

integral, we get the first one. Hence, it suffices to check the 
boundedness of the first integral only. 

- 1 

J fj(X) [tt + (1-tt) e z ] [q + (1-q) e z ] dX = 
E n 

= E { T(z,rr,q) | Hj } 
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where _ ^ 

T(z ,Tr,q) = [ IT + (1-TT) e z ] [q + (1-q) e z ] 

The derivative of T with respect to z is : 

-2 

= ( q ~ tt ) [q + (1-q) e J 
^ z 

Hence T is a monotone function of z. 

We have the following bounds : 

min/JL.lUI'N s T<z,",q> s max(— ’ 

\ q 1-q/ Vq 1-q-/ 

Hence F(q) is bounded for q/0, 1 
The values F(1),F(0) are: 

f(l) = (1 -it) [ 1 - J( 1 ) ] 

F(0) — tt [ - 1 + J(0) ] 

By the definition of the interval 1(e), we see that F(q) is bounded 
for all q in the interval 1(e). 

In a similar manner, it can be shown that F (q) is bounded for 

q?<0, 1. 

Hence, 

jM(q)| s C 2 | F # (q) | j q- it | < C 2 C 3| q_1T l 

for q , 1 
where C 3 < + » 
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The first part of Condition 2a has been satisfied. 

The second part is satisfied also, if we observe that F(q) is a 
strictly monotone function of q. 

Because of the boundedness of F (q) , Condition 3a also easily 
satisfied, with 

a = m'(tt) = L(tt) f'(tt) 

Also we note that 

f'(tt) = -J(tt) 

For Condition 4a, we must compute 

E [z 2 ( X , q) j q ] = E [G(X,q) L(q) + M(q) ] 2 = 

= L 2 ( q) E [g 2 ( X , q) | q ] - M 2 (q) = 

= L 2 (q) [- F'(q) ] - M 2 (q) 

For q^O, 1 the above quantity is finite, hence Condition 4a is 
satisfied. Also, we need to compute the quantity : 

S(tt) = lim E Fz 2 (X , q) | q ] = L 2 (tt) J<tt) 

Z-*TT 
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Appendix IV 

In this appendix, we will seek upper bounds to the integrals J s ^ ( 77 ) > 
s,k = l, ...» M-l. 

J s k<"> = I ( f s< X > - f M< X > ) ( f k< x > - f M< X > ) • 
E n 

M-l M-l 

• [ I -m f m< X > + C 1 ' I , "m) ' 

m= 1 m= 1 

• f M< X > ] _1 dX 
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Hence, 

J k k< 7T > - (\ + " M ) J ( f k < x > - f M< x >) 2 • 

E n 

-- 1 

. [p<K,M) f k (X) + ( 1 - p ( K , M) f M (X)J dX 

where 

P(K,M) = +n M ] 1 

In Appendix I, an upper bound to this last integral has been found under 
the condition : 

P (k,M) 4 0,1 

Using this result, we have : 

- i 

J kk (.TT) s (TT k +tt m ) [ p (k , M) (l-p(k,M))j 

or : 

Jkk<"> * <"k + ^M )3 <"k "M)' 1 
under the condition : 

"k n M * 0 

Using the Schwarz inequality, we can upper bound J s ^ ( 77 ) 

[W">] 2 ■ { I 0<x I ’>]'* [*,<*> • * M W] ■ 

E n 

_ x 2 

. [g(X|w>] 8 [f k (X) - f M (X)] dX } £ 
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s J [g(X I IT )]" 1 [f B (X) - f M (X)] 2 dx . 

E n 

• J [g(X ITT )]" 1 [f k (X) - f M (X)] 2 dX 
E n 

Hence 

[j sk <->] 2 * Jkk(-) Jss^) 

I -I 3 / 2 

J sk <"> I s Ok + 'M* <" s + "mM 

_ i - 1 

(H, TT ) s TT X , 

• ' k s' M 

This bound is valid for 

n s' 9 n M ^ 0 

As a conclusion, we see that if * lies in the interior of the set I M , 
the functions J , (tt) are finite. 

S K 

Hence, the part of condition 3' related to the finiteness of the above 
functions, is satisfied. 
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Appendix V 

In the present Appendix, we will check the satisfaction of Conditions 
1-5, based on the Assumptions 1-2. For Condition 1, we construct 
the scalar function 

A(X) = (Q - tt) T m[ rr + \ (Q - tt)1 

defined for \ e [0 , 1 ] 

We have 

T 

A(0) = (Q -tt) M(tT) = 0 

A(l) = (Q -tt) T m(Q) 

The derivative of A( \ ) is ; 

M- 1 

A ' (x) = l <% - V — M [tt+ X (Q - TT)1 

s= 1 &X L J 

But : 

M S [ n + *<Q -">] = 

M- 1 

= -L(Q) J g(X[n) [x l [f k (X) -f M (X)] (q k - Tr k ) 
E n k = l 

M- 1 

+ I [ f k (X > - f M< X >] " k + f M (X >] 1 • 
k= 1 J J 

• [ f s< X > - f M< X >] dX 

Hence : 

-1 M s [rr + x(Q - rr )1 = L (Q) . 

5X J 
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J g(X | TT ) [g^X | TT + X (Q - TT)) 

E n _ 

"■ M-l 

l [ f k< X > ' f M< X >] <1k - "k> ‘ 
k = 1 _ 

[f s ( x ) - f M (X)] dX 


Substituting, we have the following expression for A (x): 


a'(x) = L(Q) J g(X| TT) [g(x I TT + \(Q - TT ) ) ] 


-2 


n 


E 

M- 1 


l <q k - V [ f k< x > - f M < x >] 


k= 1 


dX 


or, more compactly : 


-2 


'(X) = MQ) J g(x I ">[s( X I ’ + x(Q - tt))] . 


E 


n 


-2 

. [g(X I TT) - g(X | Q)] dX 

♦ 

We have, therefore : 

A '(x) * 0 V\e[0, 1] 

The case A / {x) = 0 will occur iff g(X | Q)«g(X|rr) VXeE n . 
But, due to the identif lability assumption of |f.(X)|, this would 


imply Q = tr . 
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Hence, for Q^tt 

we have A (\) > 0 Vxe[0,l]. 

Therefore, 

A ( 1 ) = (Q - 7T ) T M(Q) >0 VQ 4 tt 

and Condition 1 is satisfied. For Condition 2, we apply the mean 

value theorem to the scalar function of \ , M^tt + x(Q _TT )J » 
between the points \ = 0 and \ = 1 . 

M- 1 

M k (Q) = M k (rr) + £ (q g - TT g ) • 

3=1 ^q s 

• M k[ + x k^ ' ”0 
where [0, 1 ] . 

Substituting, we have : 

M- 1 

M k (n) = L(Q) l <q g - Tr g ) C ks 

S = 1 

where 

C ks = X k I I Qk>]' 2 [ f s< X > ■ f M< X >]- 

E n 

• [ f k< X > ‘ f M< X >] dX 

with 

Q k = " + x k<Q - "> = < p i p 2 ••• p m-i> t 
A lso, let 

M- 1 

P M = 1 ' I P ) 
j=l 
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Therefore, 


[M k (Q)] = L 2 (Q) 


and 


M- 1 


~~[2 


I (q k - c ksi 

k= 1 


L*(Q) 


M- 1 


l 


ks 


s = l 


I <3 ' " 


|| M(Q) | 2 - I ' 1 [M k (Q)f « 


k= 1 


* L (Q) 

Lk= 1 s = l _J 
We can bound the quantities C kg , with a method similar to the one 


M- 1 M-l 2 
I l C ks 


Q - TT 


used in Appendix IV. 
The result is : 


C ks * * x k [ max (Pfc 1 ’ P M ) + max ( p s 1 ’ P M ) ] 


Therefore, for Qel^(A) , 


and with K 


I C ks < + " 


k , s 

we have satisfied Condition 2. For Condition 3, we use the second order 
mean value theorem for the scalar function of \ , 

MjTtt + \(Q - it)] , between the points [0,1] 
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M-l 

M k ( Q) = Y, (q s ' TT s ) -3- M k <Tr) + 
s = 1 s 

M-l M-l 

+ X X < q s - "s> (q j • "j> • 

S=1 j=l 



M k [^ + X k (Q 



where [0, 1 ] . 

Hence, we can write : 

M(Q) = B(Q - tt) + (Q - rr) T W(Q - tt ) 
where B = L(tt) D and D is a (M-l) x (M-l) matrix with 
elements , 


D ij = I !^< X I")]" 1 [ f i< X) ' f M< X >] ' 
E n 

. [f.(X) - f M (X)] dx 


The matrix W is ( M - 1 ) x ( M - 1 ) and has element ( s , j ) the 
number : 



M- 1 

X 


k= 1 


M k[ TT + X k^ 



It can be shown, again, that for > t ^ e ^bove terms are 

bounded, with methods similar to those of Appendix L 
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Hence 


(Q ■ - tt) 1 W(Q ■ - tt) 


Q - TT 


is upper bounded by a finite number. 


Furthermore, let Y = (y 1 • • • ) be an arbitrary .vector, 


Yj| 4 0 . 

Then 


Y T B Y = L(tt) J [g(X j tt )] . 

E n 
‘M-l 


t 2 


dX 


I y k( f k< X > ' f M< X >) 

. k = 1 

M-l M-l 

Y T BY can be zero iff £ y k f k^ X ^ " f M^ X * Z y k 

k= 1 k= 1 

VX sE n . 

The identifiability of the set (f.(X)) makes this impossible. 

Therefore, B is positive definite. The above facts show that 
Condition 3 is satisfied. 

We must compute 


= 0 


E [ |Z(X,Q)|| | Q ] = 


= L 2 (Q) J g(X | n) [g(X | Q)] 


-2 


E 


n 
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M_1 -,2 , | 2 

I [ f k< X) ' f M (X >] dX ' I M(Q) I 

k = l 

The first integral can be upper bounded in the same manner as C^. 
We have : 

J g(X I tt) [g(X | Q)] [f k (X) ■ f M^ X )] dX s 
E n 

s max (q fc , q^ ) 

where 

Q = (qj • • • q M . i) 

and 

M- 1 

q = 1- l q. 

]=1 

For 

Q » • • • j Q »Q ^ 0 ^ 

1 M- 1 M 

each term is bounded. 

Hence, for Qel^j(A) > th e expected value of the norm of Z(X,Q) 
is bounded. The matrix S(rr) has elements 

S (n) = L 2 (TT) J [g(X I TT)] [f.(X) - f M (X)] • 

E n 

. [f.(X) - f M (X)] dX 

or : 

S(tt) = L 2 (tt) D(tt) 
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It has been shown already that D(tt) is positive definite. 

Hence Condition 4 is satisfied. Because of the nature of the algorithm. 

Condition 5 is easily shown to be satisfied. 

For our case, we have : 

B(rr) = L(tt) D(tt) 

S(tt) = L 2 (tt) D(tt) 

The matrix S*(tt) is : 

S* (tt) = P ' 1 S(tt) P = P " 1 L 2 (tt) D(tt) P = 

= L (tt) P " 1 B(tt) P = 

= L (rr) diagfbj . . . b M . j) 

Let d, a d 2 a . . . a d M-l be the ei S envalues of °( TT )- 
Then, 

b k = L(tt) d k 

and 

S* (tt) = L 2 (tt) diag(d x . . . d M _ j) 

The matrix F is, therefore, diagonal : 

F ( tt) = L 2 (tt) diag[(2L(Tr) d l - 1 )' 1 dj 

( 2L(tt) d M-1 - l) d M . 1 ] 
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